Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Linux] Prevent GC from running during process teardown #57832

Merged
merged 3 commits into from
Mar 24, 2025

Conversation

d-netto
Copy link
Member

@d-netto d-netto commented Mar 19, 2025

Context

We send a signal 15 to shutdown our servers.

We noticed that some of our servers that receive the termination signal are segfaulting in GC, which leads to false alarms in our internal monitors that track GC-related crashes.

Hypothesis

We suspect this pathological case may be happening:

  • Process receives signal 15, which is captured by the signal listener thread.
  • Signal listener initiates process' teardown (e.g. through raise).
  • IIRC such operation is not atomic in Linux, i.e. the kernel will gradually kill the threads, but it's possible for us to spent a few ms in a state where part of the threads in the system are alive, and part have already been killed (this point needs some confirmation).
  • With part of the process alive, and part of the process dead, we try to enter a GC, see a bunch of Julia data structures in an intermediate/corrupted state, which leads us to crash when running the GC.

Mitigation

Since our main goal is to get rid of the GC crashes that happen around server shutdown, we believe that it would be sufficient to just prevent the last bullet point. I.e. we prevent the system from even running a GC when we're about to kill the process, and we wait for any ongoing GC to finish.

Co-debugged with @kpamnany.

@d-netto d-netto requested a review from vtjnash March 19, 2025 21:14
@d-netto d-netto added system:linux Affects only Linux GC Garbage collector labels Mar 19, 2025
@d-netto d-netto force-pushed the dcn-kp-no-gc-on-teardown branch from 6be4221 to 71ba5fa Compare March 19, 2025 23:37
@d-netto d-netto requested a review from gbaraldi March 19, 2025 23:38
@d-netto d-netto force-pushed the dcn-kp-no-gc-on-teardown branch 3 times, most recently from e1e3d4f to 4d64c18 Compare March 20, 2025 17:58
@gbaraldi
Copy link
Member

As I discussed tuesday with @d-netto. Master already waits for the GC to finish to exit, here

julia/src/init.c

Lines 343 to 344 in ad61241

if (ct)
(void)jl_gc_safe_enter(ct->ptls); // park in gc-safe
. So this doesn't make too much sense as a change

@vtjnash
Copy link
Member

vtjnash commented Mar 21, 2025

We should do this just before sending the signal to stop the thread though, otherwise we start running code concurrently with the GC. Can resume the GC immediately after though

@gbaraldi
Copy link
Member

I'm confused, the signal handling thread doesn't touch the GC at all (I don't think it even gets adopted), and the exiting thread will run finalizers and such so its not gc safe

@d-netto d-netto force-pushed the dcn-kp-no-gc-on-teardown branch from 4d64c18 to 2e2d508 Compare March 24, 2025 11:04
@d-netto
Copy link
Member Author

d-netto commented Mar 24, 2025

Tested this, and the suggestion from here also fixes the false alarms' issue.

@d-netto d-netto force-pushed the dcn-kp-no-gc-on-teardown branch from 2a2c495 to 5c9a592 Compare March 24, 2025 18:58
Copy link
Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you make the same change on Windows so we keep those platforms consistent?

@d-netto
Copy link
Member Author

d-netto commented Mar 24, 2025

Seems like we don't call jl_exit_thread0 in crt_sig_handler though?

@vtjnash
Copy link
Member

vtjnash commented Mar 24, 2025

oh, good point, Windows' doesn't really have signals anyways so it is not relevant there

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Mar 24, 2025
@d-netto d-netto merged commit e1e3a46 into master Mar 24, 2025
6 of 8 checks passed
@d-netto d-netto deleted the dcn-kp-no-gc-on-teardown branch March 24, 2025 23:20
@d-netto d-netto added backport 1.12 Change should be backported to release-1.12 and removed merge me PR is reviewed. Merge when all tests are passing labels Mar 24, 2025
KristofferC pushed a commit that referenced this pull request Mar 25, 2025
## Context

We send a signal 15 to shutdown our servers.

We noticed that some of our servers that receive the termination signal
are segfaulting in GC, which leads to false alarms in our internal
monitors that track GC-related crashes.

## Hypothesis

We suspect this pathological case may be happening:

- Process receives signal 15, which is captured by the signal listener
thread.
- Signal listener initiates process' teardown (e.g. through `raise`).
- IIRC such operation is not atomic in Linux, i.e. the kernel will
gradually kill the threads, but it's possible for us to spent a few ms
in a state where part of the threads in the system are alive, and part
have already been killed (this point needs some confirmation).
- With part of the process alive, and part of the process dead, we try
to enter a GC, see a bunch of Julia data structures in an
intermediate/corrupted state, which leads us to crash when running the
GC.

## Mitigation

Since our main goal is to get rid of the GC crashes that happen around
server shutdown, we believe that it would be sufficient to just prevent
the last bullet point. I.e. we prevent the system from even running a GC
when we're about to kill the process, and we wait for any ongoing GC to
finish.

Co-debugged with @kpamnany.

(cherry picked from commit e1e3a46)
@KristofferC KristofferC mentioned this pull request Mar 25, 2025
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.12 Change should be backported to release-1.12 GC Garbage collector system:linux Affects only Linux
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants